fix: use name-mapped field IDs for projection and predicate pushdown#2612
Open
viirya wants to merge 1 commit into
Open
fix: use name-mapped field IDs for projection and predicate pushdown#2612viirya wants to merge 1 commit into
viirya wants to merge 1 commit into
Conversation
When a Parquet file lacks embedded field IDs but a name mapping (schema.name-mapping.default) is available, the reader applied the mapping to the Arrow schema yet still planned column projection and predicate pushdown with the position-based fallback (field id N -> column N-1). Columns whose mapped field IDs do not line up with their physical positions were read as NULL, and predicates were evaluated against the wrong physical columns, silently returning wrong rows. Position fallback now applies only when the file has no embedded field IDs AND no name mapping is available, matching Java's three-branch ReadConf strategy. With a name mapping, projection and predicate planning resolve columns through the field IDs the mapping assigned to the Arrow schema. Closes apache#2403
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
When a Parquet file lacks embedded field IDs but a name mapping (
schema.name-mapping.default) is available, the reader applies the mapping to the Arrow schema — but it still planned column projection and predicate pushdown with the position-based fallback (field idN→ columnN-1). As a result:This contradicts Java's
ReadConf, which uses a three-branch strategy: embedded IDs → field-id projection; name mapping → apply mapping, then field-id projection; neither → position fallback. Our code applied the name mapping to the Arrow schema but then took the fallback branch anyway.Changes:
pipeline.rs: computeuse_position_fallback = missing_field_ids && task.name_mapping.is_none()and pass it (instead ofmissing_field_ids) to bothget_arrow_projection_maskandbuild_field_id_set_and_map, so position fallback only applies when there are no embedded field IDs and no name mapping.projection.rs:build_field_id_set_and_mapnow distinguishes the name-mapping case. When the Parquet descriptor has no embedded field IDs but a name mapping assigned IDs to the Arrow schema, it builds the field-id → leaf-column map from the Arrow schema'sPARQUET:field_idmetadata (newbuild_field_id_map_from_arrow_schema). Arrow leaf ordering matches Parquet leaf column ordering (both depth-first), the same invariantget_arrow_projection_maskalready relies on forProjectionMask::leaves.Are these changes tested?
Two new regression tests in
projection.rs, covering both broken paths:test_read_parquet_with_name_mapping_uses_mapped_field_ids: a file without field IDs whose columns[name, subdept]map to non-contiguous field IDs (2, 4). On main, projection NULL-fills both columns; with this fix, the values are read correctly.test_predicate_on_name_mapped_file_uses_mapped_field_ids: a predicatename = "Alice"on the same file shape, with row-group filtering and row selection enabled. On main, the predicate is evaluated against the wrong physical column and returns 2 rows; with this fix, it returns the single correct row.Both tests fail on main (verified by reverting the
pipeline.rsstrategy line) and pass with the fix. The fullarrow::unit-test suite (97 tests) passes.